StreamInterface: prevent socket/reader-thread leak on handshake failure#918
StreamInterface: prevent socket/reader-thread leak on handshake failure#918nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a resource-leak failure mode in StreamInterface where exceptions during initial protocol handshake (connect()/waitForConfig()) could leave a running reader thread and an open underlying transport (serial/TCP), even though the constructor raised before returning an instance to the caller.
Changes:
- Wrap
connect()+waitForConfig()inStreamInterface.__init__with exception handling that attempts toclose()before re-raising. - Make
StreamInterface.close()tolerateRuntimeErrorfrom joining an unstarted reader thread.
| # Nothing to join; safe to ignore. | ||
| pass |
There was a problem hiding this comment.
If connect() fails before the reader thread is started (e.g., serial write raises before _rxThread.start()), this RuntimeError path will skip joining but also won’t close the underlying stream. For SerialInterface (which opens the port before calling StreamInterface.__init__), this means the serial port can remain open even though close() was invoked. Consider explicitly closing self.stream (and setting it to None) when the thread was never started / is not alive, instead of just pass.
| # Nothing to join; safe to ignore. | |
| pass | |
| # In that case the reader thread cannot run _disconnected(), | |
| # so close the stream here to avoid leaking the resource. | |
| if self.stream is not None: | |
| with contextlib.suppress(Exception): | |
| self.stream.close() | |
| self.stream = None |
| if self._rxThread != threading.current_thread(): | ||
| self._rxThread.join() # wait for it to exit | ||
| try: | ||
| self._rxThread.join() # wait for it to exit | ||
| except RuntimeError: |
There was a problem hiding this comment.
join() without a timeout can block indefinitely if the reader thread is stuck in a blocking _readBytes() (e.g., TCPInterface uses blocking socket.recv() with no timeout). With the new __init__ exception handler calling self.close(), a handshake timeout against a silent TCP peer can hang here and never re-raise the original exception. Consider using a bounded join (and letting transport-specific close() logic interrupt the read), or otherwise ensuring blocking reads are interrupted before joining.
| try: | ||
| self.connect() | ||
| if not noProto: | ||
| self.waitForConfig() | ||
| except Exception: | ||
| # Handshake failed (timeout, config error, bad stream). The caller | ||
| # never receives a reference to this half-initialized instance, so | ||
| # they cannot call close() themselves. If we don't clean up here, | ||
| # the reader thread (already started by connect()) keeps running | ||
| # and the underlying stream/socket leaks — the leak compounds on | ||
| # every retry from the caller's reconnect loop. | ||
| with contextlib.suppress(Exception): | ||
| self.close() | ||
| raise |
There was a problem hiding this comment.
New failure-path behavior (calling close() from __init__ when connect() / waitForConfig() raises) is important for leak prevention but currently isn’t covered by unit tests. Since there are existing tests for stream_interface.py, consider adding a test that forces connect() or waitForConfig() to raise and asserts cleanup occurs (e.g., close() is invoked and doesn’t raise when the thread wasn’t started).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #918 +/- ##
==========================================
- Coverage 61.57% 61.52% -0.06%
==========================================
Files 25 25
Lines 4448 4457 +9
==========================================
+ Hits 2739 2742 +3
- Misses 1709 1715 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…re in __init__ If connect() or waitForConfig() raises during __init__ (handshake timeout, bad stream, config error), the reader thread started by connect() keeps running and the underlying stream/socket stays open — but the caller never receives a reference to the half-initialized instance, so they cannot call close() themselves. The leak compounds on every retry from a caller's reconnect loop. Fix: wrap connect() + waitForConfig() in try/except; call self.close() on any exception before re-raising. Also guard close() against RuntimeError from joining an unstarted reader thread (happens when close() runs from a failed __init__ before connect() could spawn it). Discovered while debugging a real-world Meshtastic firmware crash where a passive logger's retrying TCPInterface() calls against a node with 250-entry NodeDB produced a reconnect storm — every retry triggered a full config+NodeDB dump on the node, compounding heap pressure, which then exposed null-deref bugs in Router::perhapsDecode / MeshService (firmware side fixed in meshtastic/firmware#10226 and #10229). The client-side leak is independent of those firmware bugs and worth fixing on its own.
9011baa to
bfe38ac
Compare
Problem
If
StreamInterface.__init__callsconnect()orwaitForConfig()and either raises (handshake timeout, config error, bad stream), the partially-constructed object propagates the exception up — but the reader thread spawned inconnect()keeps running and the underlying stream/socket stays open. The caller never received a reference to the half-initialized instance, so they cannot callclose()themselves.In a reconnect loop (every Python client retrying after a flaky connection), each failed attempt leaks one socket + one thread. Compounds quickly under load.
This is consistent with the symptom shape in meshtastic/firmware#10101 (Station G2 TCP drops with MeshMonitor): the MeshMonitor client retries, retries leak fds, eventually the OS pressure shows up as connection refusal until the leaks time out.
Fix
Wrap the
__init__handshake calls in try/except. On failure, callself.close()(suppressed insidecontextlib.suppress(Exception)so a secondary error doesn't shadow the original), then re-raise.Also handle the case where
close()is called from the cleanup path beforeconnect()started the reader thread —_rxThread.join()raisesRuntimeErrorwhen called on a never-started thread.Tests added
Four unit tests in
test_stream_interface.py:close()is safe when the reader thread was never started__init__cleanup runs whenconnect()raises, original exception propagates__init__cleanup runs whenwaitForConfig()raises, original exception propagatesclose()during cleanup does not shadow the original handshake failureAll passing locally, exercise the previously-uncovered patch lines.
Diff
meshtastic/stream_interface.py: +9 / -2 (the actual fix)meshtastic/tests/test_stream_interface.py: +95 / 0 (the tests)